97 research outputs found

    Contributions of temporal encodings of voicing, voicelessness, fundamental frequency, and amplitude variation to audiovisual and auditory speech perception

    Get PDF
    Auditory and audio-visual speech perception was investigated using auditory signals of invariant spectral envelope that temporally encoded the presence of voiced and voiceless excitation, variations in amplitude envelope and F-0. In experiment 1, the contribution of the timing of voicing was compared in consonant identification to the additional effects of variations in F-0 and the amplitude of voiced speech. In audio-visual conditions only, amplitude variation slightly increased accuracy globally and for manner features. F-0 variation slightly increased overall accuracy and manner perception in auditory and audio-visual conditions. Experiment 2 examined consonant information derived from the presence and amplitude variation of voiceless speech in addition to that from voicing, F-0, and voiced speech amplitude. Binary indication of voiceless excitation improved accuracy overall and for voicing and manner. The amplitude variation of voiceless speech produced only a small increment in place of articulation scores. A final experiment examined audio-visual sentence perception using encodings of voiceless excitation and amplitude variation added to a signal representing voicing and F-0. There was a contribution of amplitude variation to sentence perception, but not of voiceless excitation. The timing of voiced and voiceless excitation appears to be the major temporal cues to consonant identity. (C) 1999 Acoustical Society of America. [S0001-4966(99)01410-1]

    A mixed inventory structure for German concatenative synthesis

    Get PDF
    In speech synthesis by unit concatenation a major point is the definition of the unit inventory. Diphone or demisyllable inventories are widely used but both unit types have their drawbacks. This paper describes a mixed inventory structure which is syllable oriented but does not demand a definite decision about the position of a syllable boundary. In the definition process of the inventory the results of a comprehensive investigation of coarticulatory phenomena at syllable boundaries were used as well as a machine readable pronunciation dictionary. An evaluation comparing the mixed inventory with a demisyllable and a diphone inventory confirms that speech generated with the mixed inventory is superior regarding general acceptance. A segmental intelligibility test shows the high intelligibility of the synthetic speech

    Does training with amplitude modulated tones affect tone-vocoded speech perception?

    Get PDF
    Temporal-envelope cues are essential for successful speech perception. We asked here whether training on stimuli containing temporal-envelope cues without speech content can improve the perception of spectrally-degraded (vocoded) speech in which the temporal-envelope (but not the temporal fine structure) is mainly preserved. Two groups of listeners were trained on different amplitude-modulation (AM) based tasks, either AM detection or AM-rate discrimination (21 blocks of 60 trials during two days, 1260 trials; frequency range: 4Hz, 8Hz, and 16Hz), while an additional control group did not undertake any training. Consonant identification in vocoded vowel-consonant-vowel stimuli was tested before and after training on the AM tasks (or at an equivalent time interval for the control group). Following training, only the trained groups showed a significant improvement in the perception of vocoded speech, but the improvement did not significantly differ from that observed for controls. Thus, we do not find convincing evidence that this amount of training with temporal-envelope cues without speech content provide significant benefit for vocoded speech intelligibility. Alternative training regimens using vocoded speech along the linguistic hierarchy should be explored

    Hemispheric Asymmetries in Speech Perception: Sense, Nonsense and Modulations

    Get PDF
    Background: The well-established left hemisphere specialisation for language processing has long been claimed to be based on a low-level auditory specialization for specific acoustic features in speech, particularly regarding 'rapid temporal processing'.Methodology: A novel analysis/synthesis technique was used to construct a variety of sounds based on simple sentences which could be manipulated in spectro-temporal complexity, and whether they were intelligible or not. All sounds consisted of two noise-excited spectral prominences (based on the lower two formants in the original speech) which could be static or varying in frequency and/or amplitude independently. Dynamically varying both acoustic features based on the same sentence led to intelligible speech but when either or both acoustic features were static, the stimuli were not intelligible. Using the frequency dynamics from one sentence with the amplitude dynamics of another led to unintelligible sounds of comparable spectro-temporal complexity to the intelligible ones. Positron emission tomography (PET) was used to compare which brain regions were active when participants listened to the different sounds.Conclusions: Neural activity to spectral and amplitude modulations sufficient to support speech intelligibility (without actually being intelligible) was seen bilaterally, with a right temporal lobe dominance. A left dominant response was seen only to intelligible sounds. It thus appears that the left hemisphere specialisation for speech is based on the linguistic properties of utterances, not on particular acoustic features

    Sensitivity of the human auditory cortex to acoustic degradation of speech and non-speech sounds

    Get PDF
    The perception of speech is usually an effortless and reliable process even in highly adverse listening conditions. In addition to external sound sources, the intelligibility of speech can be reduced by degradation of the structure of speech signal itself, for example by digital compression of sound. This kind of distortion may be even more detrimental to speech intelligibility than external distortion, given that the auditory system will not be able to utilize sound source-specific acoustic features, such as spatial location, to separate the distortion from the speech signal. The perceptual consequences of acoustic distortions on speech intelligibility have been extensively studied. However, the cortical mechanisms of speech perception in adverse listening conditions are not well known at present, particularly in situations where the speech signal itself is distorted. The aim of this thesis was to investigate the cortical mechanisms underlying speech perception in conditions where speech is less intelligible due to external distortion or as a result of digital compression. In the studies of this thesis, the intelligibility of speech was varied either by digital compression or addition of stochastic noise. Cortical activity related to the speech stimuli was measured using magnetoencephalography (MEG). The results indicated that degradation of speech sounds by digital compression enhanced the evoked responses originating from the auditory cortex, whereas addition of stochastic noise did not modulate the cortical responses. Furthermore, it was shown that if the distortion was presented continuously in the background, the transient activity of auditory cortex was delayed. On the perceptual level, digital compression reduced the comprehensibility of speech more than additive stochastic noise. In addition, it was also demonstrated that prior knowledge of speech content enhanced the intelligibility of distorted speech substantially, and this perceptual change was associated with an increase in cortical activity within several regions adjacent to auditory cortex. In conclusion, the results of this thesis show that the auditory cortex is very sensitive to the acoustic features of the distortion, while at later processing stages, several cortical areas reflect the intelligibility of speech. These findings suggest that the auditory system rapidly adapts to the variability of the auditory environment, and can efficiently utilize previous knowledge of speech content in deciphering acoustically degraded speech signals.Puheen havaitseminen on useimmiten vaivatonta ja luotettavaa myös erittÀin huonoissa kuunteluolosuhteissa. Puheen ymmÀrrettÀvyys voi kuitenkin heikentyÀ ympÀristön hÀiriölÀhteiden lisÀksi myös silloin, kun puhesignaalin rakennetta muutetaan esimerkiksi pakkaamalla digitaalista ÀÀntÀ. TÀllainen hÀiriö voi heikentÀÀ ymmÀrrettÀvyyttÀ jopa ulkoisia hÀiriöitÀ voimakkaammin, koska kuulojÀrjestelmÀ ei pysty hyödyntÀmÀÀn ÀÀnilÀhteen ominaisuuksia, kuten ÀÀnen tulosuuntaa, hÀiriön erottelemisessa puheesta. Akustisten hÀiriöiden vaikutuksia puheen havaitsemiseen on tutkttu laajalti, mutta havaitsemiseen liittyvÀt aivomekanismit tunnetaan edelleen melko puutteelisesti etenkin tilanteissa, joissa itse puhesignaali on laadultaan heikentynyt. TÀmÀn vÀitöskirjan tavoitteena oli tutkia puheen havaitsemisen aivomekanismeja tilanteissa, joissa puhesignaali on vaikeammin ymmÀrrettÀvissÀ joko ulkoisen ÀÀnilÀhteen tai digitaalisen pakkauksen vuoksi. VÀitöskirjan neljÀssÀ osatutkimuksessa lyhyiden puheÀÀnien ja jatkuvan puheen ymmÀrrettÀvyyttÀ muokattiin joko digitaalisen pakkauksen kautta tai lisÀÀmÀllÀ puhesignaaliin satunnaiskohinaa. PuheÀrsykkeisiin liittyvÀÀ aivotoimintaa tutkittiin magnetoenkefalografia-mittauksilla. Tutkimuksissa havaittiin, ettÀ kuuloaivokuorella syntyneet herÀtevasteet voimistuivat, kun puheÀÀntÀ pakattiin digitaalisesti. Sen sijaan puheÀÀniin lisÀtty satunnaiskohina ei vaikuttanut herÀtevasteisiin. Edelleen, mikÀli puheÀÀnien taustalla esitettiin jatkuvaa hÀiriötÀ, kuuloaivokuoren aktivoituminen viivÀstyi hÀiriön intensiteetin kasvaessa. Kuuntelukokeissa havaittiin, ettÀ digitaalinen pakkaus heikentÀÀ puheÀÀnien ymmÀrrettÀvyyttÀ voimakkaammin kuin satunnaiskohina. LisÀksi osoitettiin, ettÀ aiempi tieto puheen sisÀllöstÀ paransi merkittÀvÀsti hÀiriöisen puheen ymmÀrrettÀvyyttÀ, mikÀ heijastui aivotoimintaan kuuloaivokuoren viereisillÀ aivoalueilla siten, ettÀ ymmÀrrettÀvÀ puhe aiheutti suuremman aktivaation kuin heikosti ymmÀrrettÀvÀ puhe. VÀitöskirjan tulokset osoittavat, ettÀ kuuloaivokuori on erittÀin herkkÀ puheÀÀnien akustisille hÀiriöille, ja myöhemmissÀ prosessoinnin vaiheissa useat kuuloaivokuoren viereiset aivoalueet heijastavat puheen ymmÀrrettÀvyyttÀ. Tulosten mukaan voi olettaa, ettÀ kuulojÀrjestelmÀ mukautuu nopeasti ÀÀniympÀristön vaihteluihin muun muassa hyödyntÀmÀllÀ aiempaa tietoa puheen sisÀllöstÀ tulkitessaan hÀiriöistÀ puhesignaalia

    The Natural Statistics of Audiovisual Speech

    Get PDF
    Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2–7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver
    • 

    corecore